Introduction to Open Data Science - Course Project

Chapter 1: Introduction

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Hi I am Garima. In the previous period, I was a student of Quantitative skills and that’s where I heard about the course This course, according to my understanding will help me explore data more freely and will enhance my knowledge on wrangling with data in a more professional and precise manner. As this is an advanced level course, I truly believe it will further help me with my masters thesis too especially while working out my methodology

My favorite chapter from the RHDS book are chapter 3, 4 and 5 as they are once that truly helped me with my basics. I truly enjoy using the mutate function and also visualizing different plots

Git Repository

My Git Repository link : https://github.com/garimahelsinki/IODS-project.git

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Dec  9 16:11:27 2022"

The text continues here.


Chapter 2: Multiple regression Analysis

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Dec  9 16:11:27 2022"

Here we go again…

setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")

library(readr)
Data2 <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/learning2014.csv",show_col_types = FALSE)
Data2
## # A tibble: 166 × 7
##    gender   age attitude  deep  stra  surf points
##    <chr>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 F         53      3.7  3.58  3.38  2.58     25
##  2 M         55      3.1  2.92  2.75  3.17     12
##  3 F         49      2.5  3.5   3.62  2.25     24
##  4 M         53      3.5  3.5   3.12  2.25     10
##  5 M         49      3.7  3.67  3.62  2.83     22
##  6 F         38      3.8  4.75  3.62  2.42     21
##  7 M         50      3.5  3.83  2.25  1.92     21
##  8 F         37      2.9  3.25  4     2.83     31
##  9 M         37      3.8  4.33  4.25  2.17     24
## 10 F         42      2.1  4     3.5   3        26
## # … with 156 more rows

Observation The data contains 166 rows and 7 columns representing various variables like gender, age, attitude, deep, stra, surf, points, etc

Visualizations with ggplot2

# Access the gglot2 library
library(ggplot2)

# initialize plot with data and aesthetic mapping
p1 <- ggplot(Data2, aes(x = attitude, y = points, col = gender))


# define the visualization type (points)
p2 <- p1 + geom_point()

# draw the plot
p2

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

p3
## `geom_smooth()` using formula 'y ~ x'

# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# draw the plot!

p4
## `geom_smooth()` using formula 'y ~ x'

Exploring data frame

pairs(Data2[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()

p <- ggpairs(Data2, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

#draw the plot

p

Interpretation of the Correlation Plot above Here, in the above plot, the variable names are displayed on the outer edges of the matrix. The boxes along the diagonals display the density plot for each variable whereas the boxes in the lower-left corner display the scatterplot between each -pair of variables. The boxes in the upper right corner display the Pearson correlation coefficient between each variable.

The Pearson correlation gives us the measure of the linear relationship between two variables. It has a value between -1 to 1, where a value of -1 signifies a total negative linear correlation, 0 signifies no correlation, and + 1 signifies a total positive correlation.

We can see That there is no correlation between deep with points and age. Besides surf doesn’t have a linear relation with any of the other variables as but the relation is although not a total negative yet linearity is missing between them

Simple regression

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = Data2) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# fit a linear model
my_model <- lm(points ~ attitude, data = Data2)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretations The p-value suggests that there there is a significant relation between points and attitude

Multiple regression

library(GGally)
library(ggplot2)
# create an plot matrix with ggpairs()
ggpairs(Data2, lower = list(combo = wrap("facethist", bins = 20)))

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
#adding third variable as surf

my_model3 <- lm(points ~ attitude + stra + surf, data = Data2)

#printing out a summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Interpretation of multiple regression There is a significant increase in points with 1 unit increase in attitude. With 1 unit increase in attitude there is not a significant increase in stra and surf

##Graphical model validation

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5

plot(my_model2, c(1,2,5))

Interpretation of residual models

1.Residuals vs Fitted Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line. We can also note the heteroskedasticity: as we move to the right on the x-axis, the spread of the residuals seems to be increasing. Finally, points 145, 56, and 35 may be outliers, with large residual values. Let’s look at another data set

2. Normal QQ-plot In the diagram below, the quantile values of the standard normal distribution are plotted on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset are plotted on the y-axis. You can see that the points fall close to the 45-degree reference line. In the above plot we can see that although most of the observations are equally distributed and falls on the reference line but in the beginning and in the end of the plot, there are some discrepancies, especially with points 145, 35 and 56 in the beginning.

3. Residuals vs Leverage plot *We’re looking at how the spread of standardized residuals changes as the leverage, or sensitivity of the fitted _i to a change in y_i, increases. Second, points with high leverage may be influential:For this we can look at Cook’s distance, which measures the effect of deleting a point on the combined parameter vector. Cook’s distance is the dotted red line here, and points outside the dotted line have high influence. In this case there are no such influential points*

Making predictions

# Create model object m
m <- lm(points ~ attitude, data = Data2)

# print out a summary of the model
summary(m)
## 
## Call:
## lm(formula = points ~ attitude, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)

# Print out the new data
summary(new_data)
##     attitude    
##  Min.   :2.200  
##  1st Qu.:2.725  
##  Median :3.350  
##  Mean   :3.325  
##  3rd Qu.:3.950  
##  Max.   :4.400
# Predict the new students exam points based on attitude
predict(m, newdata = new_data) 
##      Mia     Mike   Riikka    Pekka 
## 25.03390 27.14918 19.39317 21.86099

Interpreation *The predict functions tells that for the new data and new students as well there is a substantial increase in one unit of attitude with increase in one unit of points.

End of chapter


Chapter 3: Logistic regression Analysis

date()
## [1] "Fri Dec  9 16:11:51 2022"

About the Data set

The data was collected for the following paper: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

The secondary student achievement of two Portuguese schools is the focus of these data. The data variables were gathered utilizing school reports and surveys and include student grades, demographic, social, and school-related features. Regarding the performance in two different courses, Mathematics (mat) and Portuguese language(por), two datasets are supplied . The two datasets were modeled in [Cortez and Silva, 2008] under binary/five-level classification and regression tasks. Please take note that the target attribute G3 strongly correlates with the traits G2 and G1. This is due to the fact that G3 is the final year grade (given at the third period), but G1 and G2 are the grades for the first and second periods, respectively.

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira )

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Reading data

setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")

library(readr)
alc <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/alc.csv",show_col_types = FALSE)
alc
## # A tibble: 370 × 41
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob   reason
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>  <chr> 
##  1 GP     F        18 U       GT3     A           4     4 at_home  teach… course
##  2 GP     F        17 U       GT3     T           1     1 at_home  other  course
##  3 GP     F        15 U       LE3     T           1     1 at_home  other  other 
##  4 GP     F        15 U       GT3     T           4     2 health   servi… home  
##  5 GP     F        16 U       GT3     T           3     3 other    other  home  
##  6 GP     M        16 U       LE3     T           4     3 services other  reput…
##  7 GP     M        16 U       LE3     T           2     2 other    other  home  
##  8 GP     F        17 U       GT3     A           4     4 other    teach… home  
##  9 GP     M        15 U       LE3     A           3     2 services other  home  
## 10 GP     M        15 U       GT3     T           3     4 other    other  home  
## # … with 360 more rows, and 30 more variables: guardian <chr>,
## #   traveltime <dbl>, studytime <dbl>, schoolsup <chr>, famsup <chr>,
## #   activities <chr>, nursery <chr>, higher <chr>, internet <chr>,
## #   romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## #   Walc <dbl>, health <dbl>, failures <dbl>, paid <chr>, absences <dbl>,
## #   G1 <dbl>, G2 <dbl>, G3 <dbl>, alc_use <dbl>, high_use <lgl>,
## #   probability <dbl>, prediction <lgl>, prob2 <dbl>, prob3 <lgl>, …

Observation The data set, named here as “alc”, has 370 observations of 35 variables

Visualising data through plots

# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# glimpse at the alc data
glimpse(alc)
## Rows: 370
## Columns: 41
## $ school        <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G…
## $ sex           <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "…
## $ age           <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, …
## $ address       <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "…
## $ famsize       <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", …
## $ Pstatus       <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "…
## $ Medu          <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,…
## $ Fedu          <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,…
## $ Mjob          <chr> "at_home", "at_home", "at_home", "health", "other", "ser…
## $ Fjob          <chr> "teacher", "other", "other", "services", "other", "other…
## $ reason        <chr> "course", "course", "other", "home", "home", "reputation…
## $ guardian      <chr> "mother", "father", "mother", "mother", "father", "mothe…
## $ traveltime    <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,…
## $ studytime     <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,…
## $ schoolsup     <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",…
## $ famsup        <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye…
## $ activities    <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", …
## $ nursery       <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "…
## $ higher        <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
## $ internet      <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye…
## $ romantic      <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "…
## $ famrel        <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,…
## $ freetime      <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,…
## $ goout         <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,…
## $ Dalc          <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ Walc          <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,…
## $ health        <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,…
## $ failures      <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid          <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes…
## $ absences      <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,…
## $ G1            <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, …
## $ G2            <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,…
## $ G3            <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16…
## $ alc_use       <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1…
## $ high_use      <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ probability   <dbl> 0.2699557, 0.2894543, 0.2291836, 0.3126444, 0.2858545, 0…
## $ prediction    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ prob2         <dbl> 0.3555814, 0.2953958, 0.2997191, 0.3040783, 0.2822524, 0…
## $ prob3         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ probability_2 <dbl> 0.1849034, 0.1533271, 0.1849034, 0.2569653, 0.2001637, 0…
## $ prediction_2  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Rows: 15,170
## Columns: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school", "sch…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
# it may help to take a closer look by View() and browse the data
gather(alc) %>% View

# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Above is the plot of each variable in the data set and represent how each variable is distributed

About my 4 chosen variables

I am choosing the following variable schoolsup - extra educational support(binary: yes or no) higher - wants to take higher education (binary: yes or no) famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) final grade numeric: from 0 to 20, output target

The variable ‘schoolsup’ will help me to understand the impact of high alcohol use on how much support the student need in their studies. My hypothesis is that with increased alcohol use, there will be increase the support needed

With the help of the variable ‘higher’, I want to study whether the willingness to pursue higher studies changes between those who consume more alcohol than others. My hypothesis is that there will be an effect but not so significant on desire to pursue higher education with increased alcohol consumption

the ‘famrel’ variable will help me understand the effect of high alcohol usage on the relationship with family. My hypothesis is that there will be a strain in relationship with family with high consumption

‘final grade’ variable with help me see if there is change in final grades of the student with higher usage as compared to lower usage My hypothesis is that the marks will be negatively effected with high consumption of alcohol

#Exploring the chosen variables through visualization

#1.Exploring high_use with schoolsup i.e., extra support need from school

# access the tidyverse libraries dplyr and ggplot2
library(dplyr); library(ggplot2)

g1 <- ggplot(alc, aes(x = high_use))

g1 + geom_bar() + facet_wrap("schoolsup") + ggtitle("Effect of alcohol consumption on Educational support required") + xlab("High alcohol usage") + aes(fill = high_use)

Observation From the above visualization we can observe that, there are comparatively less students whose alcohol usage is high and requires extra educational support. Interestingly, students for those alcohol usage isn’t high, they seek more support from the school when compared to those with high usage. My hypothesis stands false for this which is very interesting for me

#2. Exploring high_use with higher i.e. desire for higher eduation

g2 <- ggplot(alc, aes(x = high_use))

g2 + geom_bar() + facet_wrap("higher") + ggtitle("Desire to take higher education vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)

Observation The above variable stands true to my hypothesis that no matter how much is the consumption for alcohol, the desire for higher education doesn’t go away. Lets take a closer look below

#Summarising to see results see more clearly

alc %>% group_by(higher, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'higher'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   higher [2]
##   higher high_use count
##   <chr>  <lgl>    <int>
## 1 no     FALSE        7
## 2 no     TRUE         9
## 3 yes    FALSE      252
## 4 yes    TRUE       102

Observation There are only 16 students that do not desire to pursue higher education out of which 9 were with high usage. This means that there is no significant relationship between alcohol usage and desire to study further, which seems logical to me as food choices can be temporary but the choice for further studies is a thought through decision for most people.

#3. Exploring high_use and its effect on famrel i.e., family relationship

g3 <- ggplot(alc, aes(x = high_use))

g3 + geom_bar() + facet_wrap("famrel") + ggtitle("family relationships (1-5) vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)

Observation Family relationship for most students have been on the positive side without any significant change with high or low usage of alcohol. Infact in the category of bad relationship, the proportion of those with low usage is higher but that is true as there are more number of students with low usage as compared to high usage in general. Therefore, my hypothesis of effect being negative stands false

#4. Exploring high_use and its effect on G3 i,e., Final Grades of student

g4 <- ggplot(alc, aes(x = high_use, y = G3))

g4 + geom_boxplot() + ggtitle("Effect of high alcohol usage on Final grades") + xlab("High alcohol usage") + ylab("Final grade")+ aes(col = high_use)

Observation We can clearly visualize that the students with high consumption have comparatively lower grades than those with low consumption. The hypothesis for the variable stands true that high alcohol consumption has negatively effected the grades of students

Logistic Regression Model

About Logistic regression model

Logistic regression is used to predict the class (or category) based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values

We will now use Logistic regression model to identify factors related to higher than average student alcohol consumption.

# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4957  -0.8206  -0.7292   1.3560   1.9346  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.83332    0.74905   2.448   0.0144 *
## schoolsupyes -0.37002    0.35929  -1.030   0.3031  
## higheryes    -0.78379    0.55077  -1.423   0.1547  
## famrel       -0.27320    0.12418  -2.200   0.0278 *
## G3           -0.07269    0.03692  -1.969   0.0490 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 437.36  on 365  degrees of freedom
## AIC: 447.36
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept) schoolsupyes    higheryes       famrel           G3 
##   1.83331647  -0.37001600  -0.78378621  -0.27320431  -0.07269305

Observation We can see that p-value for all the variables, is less than alpha(0.5) which means that with increase in one unit of alcohol consumption, there is a significant rise in support need from school and subsequently similar result can be seen for other variables i.e., there is significant rise in ‘higheryes’, ‘famrel’, and ‘G3’ respectively

There we 4 Fisher scoring iterations before the process of fitting the model stopped and output the results.

In this case, our intercept for different variable is less than 0, it matches with our p-value results observation being less than the alpha

From coefficients to odds ratios(OR)

# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR     2.5 %     97.5 %
## (Intercept)  6.2545955 1.4672230 28.1361910
## schoolsupyes 0.6907233 0.3292951  1.3616633
## higheryes    0.4566737 0.1504271  1.3477672
## famrel       0.7609373 0.5955510  0.9708981
## G3           0.9298862 0.8644588  0.9995539

Observation The OR value suggest that when one unit of alcohol consumption (high_use) increases, the odds of educational support decreases 21% and with one unit increase in schoolsup, odds of higher education decreases by 55% and odds of good fam decreases by 24% and by 8% for grades The Odds ratio of The final grade is 1:1:0:0

#Plot Visualisation by providing Confidence Intervals

library(finalfit)
dependent <- "high_use"
explanatory <- c("schoolsup", "higher", "famrel", "G3")
alc %>% 
  or_plot(dependent, explanatory,
          breaks = c(0.5, 1, 2, 5),
          table_text_size = 3.5,
          title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (geom_errorbarh).

The results of the above plot matches the observations

Explore the predictive power of the model with the significant variables

#Binary prediction (1)

# fit the model
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
m1 <- glm(high_use ~ famrel + G3, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")
prob2 <- predict(m1, type = "response")

library(dplyr)

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prob2 = prob2)

alc <- mutate(alc, prediction = probability > 0.5)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prob3 = prob2 > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, schoolsup, higher, famrel, G3, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 7
##    schoolsup higher famrel    G3 high_use probability prediction
##    <chr>     <chr>   <dbl> <dbl> <lgl>          <dbl> <lgl>     
##  1 no        yes         4    13 FALSE          0.271 FALSE     
##  2 no        yes         4     0 FALSE          0.489 FALSE     
##  3 no        yes         5     2 TRUE           0.387 FALSE     
##  4 no        yes         5    12 FALSE          0.233 FALSE     
##  5 no        yes         4     8 FALSE          0.349 FALSE     
##  6 no        yes         5     5 FALSE          0.336 FALSE     
##  7 no        yes         4    12 FALSE          0.286 FALSE     
##  8 no        yes         1     4 FALSE          0.619 TRUE      
##  9 no        yes         2    13 TRUE           0.391 FALSE     
## 10 no        yes         4    10 TRUE           0.316 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251    8
##    TRUE    103    8

Observation The model has predicted False as 251/259 The model has predicted True as 8/111 The sensitivity of the model is 8 which is not so good and specificity is 251 which is much better

In the cross table, 8 times the prediction is heavy alcohol consumption when the variable gets a value corresponding to no heavy alcohol consumption. Similarly, 103 times the prediction is no high alcohol consumption when the variable gets a value corresponding to high alcohol consumption.

# rerunning some codes to avoid errors

m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

#Binary prediction (2)

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))

# define the geom as points and draw the plot
g + geom_point() + aes(col = probability)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67837838 0.02162162 0.70000000
##    TRUE  0.27837838 0.02162162 0.30000000
##    Sum   0.95675676 0.04324324 1.00000000

Observation The table above shows that, according to the forecast, there are about 95 percent of all people who do not drink a lot of alcohol. There are approximately 70 percent of all persons who do not drink much alcohol. Note again that the forecast differs from the values of the variable.

Accuracy and loss functions

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, alc$probability)
## [1] 0.3

Observation The loss function gave the prediction error of 0.3

10-fold Cross validation

# K-fold cross-validation
library(boot)

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)


# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3162162

Observation Since the lower value means better this gives a comparatively poor result than the exercise result which was 0.26

To Improve the model,The model’s explanatory variables can be included, and the cross-validation can then be recalculated. For this I am adding the variable sex as explanatory variable as not only it relates strongly with target variable but also it will be interesting to see it from the point of view of gender

n <- glm(high_use ~ schoolsup + higher + famrel + sex, data = alc, family = "binomial")

# Summary of the model
summary(n)
## 
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4855  -0.8633  -0.6683   1.2409   1.9794  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.7570     0.7207   1.050 0.293510    
## schoolsupyes  -0.0982     0.3675  -0.267 0.789324    
## higheryes     -0.8484     0.5327  -1.593 0.111222    
## famrel        -0.3235     0.1264  -2.559 0.010489 *  
## sexM           0.9137     0.2427   3.765 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 426.62  on 365  degrees of freedom
## AIC: 436.62
## 
## Number of Fisher Scoring iterations: 4
#Calculations of the odds of ratios and the confidence interval
cbind(exp(coef(n)), exp(confint(n)))
## Waiting for profiling to be done...
##                            2.5 %    97.5 %
## (Intercept)  2.1319589 0.5214475 8.9591639
## schoolsupyes 0.9064662 0.4264470 1.8220117
## higheryes    0.4280849 0.1454536 1.2158699
## famrel       0.7236339 0.5634688 0.9265524
## sexM         2.4935013 1.5567858 4.0385316
# Prediction of the probability
probabilities_2 <- predict(n, type = "response")
# Probabilities to alc
alc <- mutate(alc, probability_2 = probabilities_2)
# Calculate a logical high use value based on probabilities.
alc <- mutate(alc, prediction_2 = probability_2 > 0.5)
# Recalculation of cross-validation and print the average number of wrong predictions.
cv_2 <- cv.glm(data = alc, cost = loss_func, glmfit = n, K = 10)
cv_2$delta[1]
## [1] 0.2837838

Interpretation The result of 0.2891892 is much better than the 0.3108108 which was the result of original model

**____End of Chapter___**


Chapter 4: Clustering and classification

date()
## [1] "Fri Dec  9 16:12:15 2022"

About the Data set

Boston data set has 506 observations and 14 variables. It is one of the pre installed data sets that comes with the Mass Package. It contains information regarding the Housing values in suburbs of Boston.

The variables involved are as follows:

crim - per capita crime rate by town.

zn - proportion of residential land zoned for lots over 25,000 sq.ft.

indus - proportion of non-retail business acres per town.

chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox - nitrogen oxides concentration (parts per 10 million).

rm - average number of rooms per dwelling.

age - proportion of owner-occupied units built prior to 1940.

dis - weighted mean of distances to five Boston employment centres.

rad - index of accessibility to radial highways.

tax - full-value property-tax rate per $10,000.

ptratio - pupil-teacher ratio by town.

black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat - lower status of the population (percent).

medv - median value of owner-occupied homes in $1000s.

Source Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Loading the Data Set, Installing package and Exploring Data set

#Loading required libraries for the Exercise

library(ggplot2)
library(dplyr)
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
library(tidyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#Loading Data Set
data("Boston")

#Exploring the Data Set
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Observation: The data set has 506 Observations of 14 variables The variables chas and rad are integers whiles others are numerical. From the function dimensions, from the Min. and Max range we can see that the variable distribution is skewed except for the variables rm, and medv.

Graphical Distribution and summary of variables

# plot matrix of the variables
pairs(Boston)

# printing the correlation matrix to look at the correlations between variables in the data
cor_matrix <- cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6,)

Observations

Pair Plot: The pair plot is the visual representation of the observation as mentioned before this

Correlation Plot: According to the Correlation plot, we can say that there many variables with high positive correlation for example the variables ‘rad’ and ‘tax’ have the highest positive correlation among all the variables. On the other hand, we can also see strong negative correlation between many variable, highest negative correlation being among the variable like ‘lstat’ and ‘medv’, ‘age’ and ‘dis’, ‘nox’ and ‘dis’, and, ‘indus’ and ‘dis’. Therefore, we can also say that ‘dis’ (distance) has the most highest negative correlation among all the other variables

Standardization of Data Set and Categorization of Variable

  1. Standardization of data Set It is the process through which we take the mean of the relevant columns, subtract them, then divide the result by the standard deviation.
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(scale(Boston))

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Observation We can see that now all variables are normally distributed and because of the standardization process, it is done in such a way that mean of all variables is equal to zero

2.Creating a categorical variable of the crime rate in the Boston data set

# Create a quantile vector of crim and create the categorical "crime".
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))

# Remove the original unscaled variable and add the new categorical value to scaled data.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# Look at the table of the new factor crime
table(boston_scaled$crim)
## 
##      low  med_low med_high     high 
##      127      126      126      127

Observation We have now created a categorical variable(previously continuous) and changed the name from crim to crime which has now been added as a separate variable in the boston_scaled data set which has 127 observation for low crime, 126 for medium to low and medium to high crime and 127 for high crime

Divide the dataset to train and test sets.

When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works. The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.

Below the size = n x 0.8 representation means that it will randomly choose the 80% of the data to create a training data set

# Number of rows in the Boston dataset
n <- nrow(boston_scaled) 

# Choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

# Create test set
test <- boston_scaled[-ind,] 

# Save the correct classes from the test data
correct_classes <- test$crime

# Remove the crime variable from the test data
test <- dplyr::select(test, -crime)

Linear Discriminant Analysis on the train set

LDA is done to find the linear combination among the variable which separates the classes of target variables For this analysis, we will use ‘Crime’ as the target variable

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2549505 0.2450495 0.2500000 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.97185603 -0.8975877 -0.077423115 -0.8589974  0.47763329 -0.8368750
## med_low  -0.07137359 -0.2795126 -0.004759149 -0.5800971 -0.12295696 -0.4126173
## med_high -0.38156320  0.2178068  0.165126514  0.3733639  0.07475939  0.4376748
## high     -0.48724019  1.0171306 -0.038441925  1.0423077 -0.32554486  0.8076000
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8378674 -0.6953230 -0.7276919 -0.47429402  0.37801491 -0.77119104
## med_low   0.3766310 -0.5525898 -0.4690521 -0.07677478  0.31587871 -0.16639107
## med_high -0.3845620 -0.4285189 -0.3139601 -0.32798934  0.05404188  0.03156696
## high     -0.8501298  1.6379981  1.5139626  0.78062517 -0.78526395  0.87322433
##                  medv
## low       0.554400254
## med_low  -0.003778154
## med_high  0.176990466
## high     -0.715586245
## 
## Coefficients of linear discriminants:
##                  LD1           LD2         LD3
## zn       0.103666467  0.7219175049 -0.98806181
## indus    0.025824507 -0.2952909439  0.35779524
## chas    -0.013744017 -0.0005349013  0.10419244
## nox      0.415172499 -0.7543979411 -1.18755880
## rm       0.038379195 -0.0164059407 -0.15214750
## age      0.231764575 -0.3203092218 -0.35869980
## dis     -0.070804609 -0.3482278252  0.21206282
## rad      3.398490193  0.8211686224 -0.01160039
## tax      0.006211215  0.0935522848  0.44314377
## ptratio  0.116416963  0.0684761770 -0.26826113
## black   -0.092532774  0.0554084371  0.11161289
## lstat    0.201905336 -0.2985533357  0.34230873
## medv     0.077603612 -0.5260772829 -0.24197959
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9543 0.0335 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(factor(train$crime))

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Observation The proportion of trace tells us that LD1 explains variance up-to 95%, LD2 does it up to 3.58% and LD3 does it for 1.07%. Through the plot we can say that the target variable crime is clearly separated, and the variable accessibility to ‘rad’ is optimal as seen through the arrow.

Prediction and Cross tabulation

(Please Note:The saving of crime categories and removal of the categorical crime variable from the test data set can be found just before the start of LDA exercise in the Divide the data set to train and test sets exercise)

# Predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      11        0    0
##   med_low    1      15        7    0
##   med_high   1       5       19    2
##   high       0       0        0   26

Observation The cross tabulation suggests that the model predicts the ‘high’ crime rate quite well. There are 102 total observation and The majority of the predicted values are predicted as correct values for the same category making the model usable for rough predictions.

Reload and standardize the Boston dataset

# Reload Boston dataset.
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)

Calculate the distance between Observations, K means clustering and visualisation

# Calculate the Euclidean distances between observations.
dist_eu <- dist(boston_scaled, method = "euclidean")

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Observation The most normal distance measure is Euclidean, according to which the distance range is between 0.1343 to 14.3970, with a mean of 4.9111 and median of 4.8241

# K-means clustering
km <-kmeans(boston_scaled, centers = "3")

# plot the Boston dataset with clusters
pairs(boston_scaled[6:10], col = km$cluster)

Observation As I have clustered the data into 5 columns and definde k-means as 3 cluster

We will now find the optimal number for clusters

# Investigate optimal number of clusters and rerun algorithm
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# Visualize the cluster.
qplot(x = 1:k_max, y = twcss, geom = 'line')

Observation We can see that optimal number drops radically and with this we can see that after every two cluster, there seems to be a change.

Let us now check how does it look with optimal number 2

# k-means clustering
km <-kmeans(dist_eu, centers = 2) 

# Plot the clusters
pairs(boston_scaled[6:10], col = km$cluster)

Observation We can see that because of our previous observation was of 3 clusters and optimal is 2, there isnt much difference but going by the results, ‘two’ seems to be the optimal number for clustering the data, when the method chosen is Euclidean

Super Bonus: Create a 3-D plot

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Plot.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# k-means with 4 clusters to compare the methods.
km3D <-kmeans(boston_scaled, centers = 4)


#Plot with km cluster as color of data
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3D$cluster[ind])

Observation

Through the above super cool 3D plots, my observation is that the k-means clustering works better. In the crime plot, it was hard to establish which variables are falling where as some of them got a bit submerged in each other.

**____End of chapter___**


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Fri Dec  9 16:12:29 2022"

For this assignment, we will use two data set, namely ‘human’ data set and ‘tea’ data set

Description of the Data

setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")

library(readr)

#Loading the human data set
human <- read.table("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/human.txt",sep="\t")

dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ SecEd_FtoM     : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ LFPR_FtoM      : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ ExpEduYears    : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ LifeExpectancy : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNIC           : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MMR            : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ ABR            : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ RepinParliament: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
View(human)

human: There are 155 countries in the human dataframe with 8 variables. The variables are:

SecEd_FtoM: Ratio of females to males educated at secondary and higher education levels

LFPR_FtoM: Ratio of females to males labour market participation rate (LFPR)

ExpEduYears: Expected years of schooling

LifeExpectancy: Life expectancy at birth (years)

GNIC: Gross national income per capita (GNI)

MMR: Maternal mortality ratio (MMR) (deaths per 100 000 live births)

ABR: Adolescent birth rate (ABR) (births per 1 000 women ages 15-19)

RepinParliament: Share of parliamentary seats held by women (% of seats)

Graphical Overview of Data Set human

library(ggplot2)
library(dplyr)
library(corrplot)
library(GGally)
library(tidyr)

summary(human)
##    SecEd_FtoM       LFPR_FtoM       ExpEduYears    LifeExpectancy 
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNIC             MMR              ABR         RepinParliament
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
p <- ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
p

#Exploring relationship between variables

cor_matrix <- cor(human) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Observation ggpairs plot :The distribution of the variables is biased. According to the scatter plot, there seems to be outliers in the plot. This could also be determined from the summary which shows a great deal of range between certain variables like GNIC with Min value at 581 and maximum being 123124, or MMR with MIn. at 1and Max at 1100

Corrplot : According to the Correlation plot, we can say that the variables MMR and ABR have a strong positive correlation. Similarly, ExpEduYear and and Life expectancy too have a positive relationship. On the other hand, Life expectancy and MMR share a high negative correlation. Interestingly, there is no relationship, positive or negative between GNIC and LFPR_FtoM, and LFPR_FtoM and SecEdFtoM. Besides this, RepinParlimaent share the bare minimum relationship, both positive and negative, with all the other variables.

#Histogram for better visualization of variables

human %>% 
  gather(key=var_name, value = value) %>% 
  ggplot(aes(x=value)) +
  geom_histogram() +
  facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Principal component analysis (PCA)

PCA on the raw (non-standardized) human data

# perform principal component analysis (with the SVD method)
pcaHuman <- prcomp(human)

# print out a summary of PCA. One gets quite a few warnings. The first component explains a whopping 99.99 % of the variance.
s <- summary(pcaHuman)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables using the first 2 components. GNIC explains looks to explain pretty much all of the first principal component.
biplot(pcaHuman, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim = c(-0.5, 0.2))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Observation According to the above table, PC1 explains that it contains all variance which we can see from the proportion of variance being 9.999e-01 and for others being 0

Observations for bi-plot Assuming that characteristics with more variation are more significant than those with lesser variance, PCA appears sensitive to the relative scale of the original variables. Among other variables, GNIC is distinctive. There are some outliers like Chad, Sierra Leone, Qatar, Central African Republic, etc that may be seen in the scatter plot because the GNIC is on a completely different scale than the other variables

Standardize, Repeat and Interpret both the analysis

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human_std1 <- prcomp(human_std)

s2 <- summary(pca_human_std1)
s2
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# Percentages of variance for the plot titles
pr_pca <- round(100*s2$importance[2, ], digits = 1)

pr_pca
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab2 <- paste0(names(pr_pca), " (", pr_pca, "%)")
biplot(pca_human_std1, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])

Observation PC1 explains 53.6% and PC2 explains 16.2% the variance of the eight variables. According to my understanding, in the previous plots it was assumed that, the bigger the mean, the bigger is the importance which is why we saw that in the previous plot, GNIC had a more distinct representation but now we can see that all variables are differently distributed. We can also say that results differed from each others, due to different scales used in outcomes and that there were a lot of zero-length arrows in the raw-model, which wasnt there for GNIC in this plot.

Personal Interpretations based on PCA biplots

Standardization makes it simpler to see the relationships between various variables. Based on the biplot, he first component PC1 represents variables relating to welfare and living standards, and the second component PCA2 is related to equality-related issues. These problems doesnt seem to be correlated according to the visualization of plots

After scaling, the biplots are unquestionably simpler to read because the various variables are on comparable scales rather than drastically different ones. I anticipate that PC1 mostly refers to the amount of resources invested in people, such as education and medical care. PC2 may be a type of equality metric that assesses how successfully women can participate in the workforce as opposed to staying at home.

It’s remarkable that these two aspects are different given that the socioeconomic level of women appears to be directly tied to reproduction. One interpretation is that while addressing concerns with reproductive rights is helpful, issues with women’s status remain a component of issues that require attention.

The tea data

About the data set The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

#Load the tea dataset and convert its character variables to factors

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Observation As stated above, the tea variable has 300 observations and 36 variables. All the variables are in form of factor except age which is present as an integer

Multiple Correspondence Analysis (MCA) on the tea data

Creating new data set keep2 which has variables that are capable to analyse the way and time prefernces of people for having tea

library(dplyr)
library(tidyr)

# column names to keep in the dataset
keep2 <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep2' to create a new dataset
tea2 <- dplyr::select(tea, one_of(keep2))

# visualize the dataset
gather(tea2) %>% ggplot(aes(value)) + 
facet_wrap("key", scales = "free") + 
geom_bar() + 
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Observation From the above visualization we can see that number of people who use tea bag is more. More people preferred to have their tea alone and not during the lunch. The preference for tea was most Earl Grey and choice to have sugar got a mixed reaction but more people inclined towards not sugar. Interestingly, People were seen to have their tea mostly in chainstore instead of tea shop.

MCA for data set tea

mca <- MCA(tea2)

summary(mca)
## 
## Call:
## MCA(X = tea2) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

Observation The executive summary demonstrates that the dimensions don’t account for a sizable fraction of the total variance. The dimensions and the variables do not have any compelling relationships.

Visualization of MCA Plot

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

Observation The plot displays how and where the variables interact to DIM1 at 15.24% and how they are connected to DIM2 at 14.23% as well. A number of other factors are connected to both dimensions.

Dimensions 1 and 2 correspond mostly to what package people use their tea in, where they drink and the price of the tea. There is also shown correspondence to what kind of tea is drank, and if they add sugar or not.

Based on the plot, we can see that variable responses tea shops, unpackaged, green, alone are highly related to Dim 1 whereas chainstore + tea shops, tea-bag + unpackaged are highly related to Dim 2 with many other responses falling in between

—–End of Chapter 5—–


Chapter 6: Analysis of longitudinal data

date()
## [1] "Fri Dec  9 16:12:49 2022"

Part 1 of this exercise analyzes RATS data, while Part 2 of this exercise analyzes BPRS data. The goal of this exercise is to use statistical techniques appropriate for longitudinal data analysis to analyze the datasets.All exercises of this week are based on the Chapters 8 and 9 of Vehkalahti and Everitt (2019), included in the special edition MABS4IODS (Part VI).

About Data Sets

BPRS Data In the BPRS data, in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

RATS Data It contains data from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.

Part 1: Implement the analyses of Chapter 8 of MABS- RATS data

Reading Data

#Applying Packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ purrr   0.3.4     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ plotly::select() masks MASS::select(), dplyr::select()
library(ggplot2)
library(readr)
library(dplyr)
library(tidyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
#Loading data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep  ='\t')

Converting Data to Long form

# Convert the categorical variables of data set to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470,…
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465,…
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475,…
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485,…
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487,…
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493,…
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493,…
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504,…
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507,…
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518,…
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525,…
# Convert to long form
RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4)))

glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Observation The RATS data set contains 16 rows and 13 columns out of which ID and Group are factors and others are integers whereas RATS Long data set contains 176 rows and 5 columns out of which 4 are factors an WD is in character form

Visualizing RATSL Dataset

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID))+
  geom_line() + scale_linetype_manual(values = rep(1:10, times=4))+
  facet_grid(. ~ Group, labeller = label_both)+
  theme_bw() + theme(legend.position = "none")+
  theme(panel.grid.minor.y = element_blank()) +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Interpretation The ggplot above demonstrate that there is a difference between weights of the rats in different groups where the rats in Group 1 have weights all below 300 in contrast to group 2 and group 3 where the lowest weight is also above 400. One thing that seems common in all the three groups is the outlier which is lower than usual for Group 1 and 3 and higher in Group 2. Another thing is that the weights have increased over a period of time. Even though in Group 1, there was a dip which gradually comes in line and increases similar to others over the period of nine months.

Standardising the variable weight

RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = Weight) %>%
  ungroup()

# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …

Observation The RATSL now has 176 rows and 6 coloums where the 6th Coloumn includes standardized weight in the form of Integer.

Visualizing standardized RATSL

#Plot again with the standardised RATSL

ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() + scale_linetype_manual(values = rep(1:10, times=4))+
  facet_grid(. ~ Group, labeller = label_both)+
  theme_bw() + theme(legend.position = "none")+
  theme(panel.grid.minor.y = element_blank())+
  scale_y_continuous(name = "standardised weight")

As we can see there is no change between values of weight and standard weight, we will create average mean and standard deviation profile of the groups to see if there it provides a better understanding

Creating average mean and standard Deviation profile of groups

# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()

RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), sd = sd(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ sd    <dbl> 15.22158, 13.09307, 11.47591, 13.60081, 11.05748, 11.78377, 10.9…

Observation The new RATSS data set has 33 rows and 4 colums in which Group is a factor, Time is integer and Mean and sd are in double precision floating point format

Visualizing the new mean growth profile for three groups of RATS

ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  #geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

Interpretation According to the above Mean weight value ggplot, all the three groups ar highly distinct from each other. The plot has made this more clear especially for Group 2 and Group 3 which seemingly had some similar plottings for certain values in the previous graphs

Summary Measure Approach

# Creating summary data by Group and ID with mean as the summary variable.
RATSL1 <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSL1)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…

Observation The new data set named as RATSL1 which is created with mean as the summary variable has 16 rows and 3 columns which Group and ID as factors and mean in double precision floating point format

Finding Outliers

#Looking at mean
RATSL1$mean
##  [1] 263.2 238.9 261.7 267.2 270.9 276.2 274.6 267.5 443.9 457.5 455.8 594.0
## [13] 495.2 536.4 542.2 536.2
#Visualising through boxplot

ggplot(RATSL1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), weeks 1-8")
## Warning: `fun.y` is deprecated. Use `fun` instead.

Observation The boxplot clearly shows that all three groups have outliers in weighted mean from weeks 1-8. In group 1, the outlier is at the lowest weighted mean. For group 2, the outlier is the extreme highest and for Group 3, the outlier is similar to that of group 1.

Removing outliers

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL2 <- filter(RATSL1, mean != 238.9, mean != 594.0, mean != 495.2)
RATSL2$mean
##  [1] 263.2 261.7 267.2 270.9 276.2 274.6 267.5 443.9 457.5 455.8 536.4 542.2
## [13] 536.2
ggplot(RATSL2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight), weeks 1-8")
## Warning: `fun.y` is deprecated. Use `fun` instead.

Observation As expected, the groups become quite distinct with removal of outliers

T-Test, Linear Model Analysis and ANOVA

Note The T-Test cannnot be performed here as the sample size is more than two

# Add the baseline from the original data as a new variable to the summary data
RATSL2 <- RATSL1 %>%
  mutate(baseline = RATS$WD1)

# Fitting the linear model with the mean as response 
fit <- lm( mean ~ Group, data = RATSL2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ Group, data = RATSL2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.900 -27.169   2.325   9.069 106.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   265.02      12.98  20.419 2.92e-11 ***
## Group2        222.77      22.48   9.909 2.00e-07 ***
## Group3        262.48      22.48  11.675 2.90e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.71 on 13 degrees of freedom
## Multiple R-squared:  0.9316, Adjusted R-squared:  0.9211 
## F-statistic: 88.52 on 2 and 13 DF,  p-value: 2.679e-08
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      2 238620  119310  88.525 2.679e-08 ***
## Residuals 13  17521    1348                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretations The p value from the above test is less than 0.5 which means that there is a significant difference in all the three groups regarding the RATS growth profile

Part 2: Implement the analyses of Chapter 9 of MABS- BPRS data

For this we will use BPRS data set, the description of which can be found in the first part of this chapter above

Reading the Data

#Read the "BPRS" data into R
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", header = TRUE, sep  =" ")

Coverting Data to long form

# Categorical variables to factors

BPRS$treatment <- factor(BPRS$treatment)

BPRS$subject <- factor(BPRS$subject)

glimpse(BPRS)
## Rows: 40
## Columns: 11
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ week0     <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week1     <int> 36, 68, 55, 77, 75, 43, 61, 36, 43, 51, 34, 52, 32, 35, 68, …
## $ week2     <int> 36, 61, 41, 49, 72, 41, 47, 38, 39, 51, 34, 49, 36, 36, 65, …
## $ week3     <int> 43, 55, 38, 54, 65, 38, 30, 38, 35, 55, 41, 54, 31, 34, 49, …
## $ week4     <int> 41, 43, 43, 56, 50, 36, 27, 31, 28, 53, 36, 48, 25, 25, 36, …
## $ week5     <int> 40, 34, 28, 50, 39, 29, 40, 26, 22, 43, 36, 43, 25, 27, 32, …
## $ week6     <int> 38, 28, 29, 47, 32, 33, 30, 26, 20, 43, 38, 37, 21, 25, 27, …
## $ week7     <int> 47, 28, 25, 42, 38, 27, 31, 25, 23, 39, 36, 36, 19, 26, 30, …
## $ week8     <int> 51, 28, 24, 46, 32, 25, 31, 24, 21, 32, 36, 31, 22, 26, 37, …
#Converting to Long form
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)

# Extract the week number
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks, 5, 5)))

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Observation The data had 40 observations for 11 variables in which treatment and subject were factors, rest all were integers In the long form, the data set now has 360 Rows and 5 columns in which treatment and subject are factors, weeks is in character format and bprs and week are in integer

Visualizing BPRS treatment groups against time.

ggplot(BPRSL, aes(x = week, y = bprs, group = subject, color = treatment)) + geom_text(aes(label = treatment)) + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1)) +scale_y_continuous(name = "BPRS") + theme(legend.position = "top") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Interpretation According to the ggplot above, the first group’s bprs rate appears to be declining with time with a somewhat steep decline in week 5 and then it sort of stabilizes. The rates in the second group change over the course of the study.

Linear Regression Model

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Interpretation We have fitted BPRS with a linear regression model using bprs as the target variable and week and treatment as the explanatory variables The findings of P-Value above the alpha (0.5) suggest that treatment does not have a significant effect on the Mode

Visualizing Individual Profiles

ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

Interpretation The plots clearly presents declining BPRS score over the period of eight weeks.In both Treatment 1 and 2, there are some observation which declined but starting increasing from week 6 onwards. The data seems to have significant differences in Individual profile in the begginning which tends to decrease with time

Linear Mixed Effects Model

Linear Mixed effect modle takes the longitudinal aspect of data in consideration

#Creating Scatter plot matrix to visualize and understand data again

pairs(BPRS[, 3:11], cex = 0.8)

A. Random Intercept Model

Linear mixed models (LMM) with error and intercept random effects are known as random intercept models. Sometimes, when inference is needed in the original scale, heteroscedasticity is taken into account and the response variable is translated into a logarithmic scale; as a result, the response variable has a log-normal distribution.

library("lme4")

BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Interpretation In the above statistical model we have fitted the Randome Intercept Model by taking week and treatment as explanatory variables The findings suggest that except the standard errors (which are smaller in this model), the coefficients are exactly same as we found in the Linear Regression model

B. Random Intercept and Slope Model

Well, unlike a random intercept model, a random slope model allows each group line to have a different slope and that means that the random slope model allows the explanatory variable to have a different effect for each group

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation In this model as well we can see that the coefficients week and treatment matches to the finding in previous models and again the standard error is different and t values have also decreased by few units. A slight decrease in standard error for treatment and slight increase in week can also be observed. The ANOVA test results indicate that these models are significantly distinct from each other

C. Interaction Model

It is important to check whether the influence of one variable depends on the level of one or more other variables when performing linear modeling or ANOVA. If so, we have what is referred to as a “interaction.” This indicates that factors interact or cooperate to influence the answer.

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation This Model has different results when compared to previous models we just tested. There is slight increase in the intercept but the estimate on treatment looks negative with a positive interaction

Visualizing Interaction Model- Observed and Fitted
Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>% mutate(Fitted)

ggplot(BPRSL, aes(x = week, y = bprs, col = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Week") +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "bottom") +
  ggtitle("Observed") 

ggplot(BPRSL, aes(x = week, y = Fitted, col = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "Week") +
  scale_y_continuous(name = "BPRS") +
  theme(legend.position = "bottom") +
  ggtitle("Fitted") 

Interpretation

From the above two plot we can observe that the Interaction model goes well with Observed data. The fitted model is clean but doesnt show much interactions and patterns wheareas in the observed plot we can clearly visualize changes. From the Observed Plot we can see that treatment has pretty distinct observations and is easy to read as well

—-End of Chapter 6—-